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Abstract — We investigate certain optimization problems for 
Shannon information measures, namely, minimization of joint 
and conditional entropies H(X,Y), H{X\Y), H(Y\X), and 
maximization of mutual information I(X; Y), over convex re- 
gions. When restricted to the so-called transportation polytopes 
(sets of distributions with fixed marginals), very simple proofs 
of NP-hardness are obtained for these problems because in 
that case they are all equivalent, and their connection to the 
well-known Subset sum and Partition problems is revealed. 
The computational intractability of the more general problems 
over arbitrary polytopes is then a simple consequence. Further, 
a simple class of polytopes is shown over which the above 
problems are not equivalent and their complexity differs sharply, 
namely, minimization of H(X, Y) and H(Y\X) is trivial, while 
minimization of H(X\Y) and maximization of I(X;Y) are 
strongly NP-hard problems. Finally, two new (pseudo)metrics 
on the space of discrete probability distributions are introduced, 
based on the so-called variation of information quantity, and 
NP-hardness of their computation is shown. 

Index Terms — Entropy minimization, maximization of mutual 
information, NP-complete, NP-hard, subset sum, partition, num- 
ber partitioning, transportation polytope, pseudometric, varia- 
tion of information. 

I. Introduction 

loint entropy H(X, Y), conditional entropies H(X\Y), 
H(Y\X), and mutual information I(X;Y), are some of the 
founding concepts of information theory. In the present paper 
we investigate some natural optimization problems associated 
with these functionals, namely, minimization of joint and 
conditional entropies and maximization of mutual information 
over convex polytopes, and show that all these problems 
are NP-hard. Certain special cases of these problems are 
found to represent information theoretic analogues of the well- 
known Subset sum and Partition problems. Our results 
will thus provide a simple, yet interesting connection between 
complexity theory and information theory. 

Various optimization problems for the above mentioned 
information measures are studied in the literature. An impor- 
tant example is the well-known Maximum entropy principle 
formulated by Jaynes f£|, which states that, among all proba- 
bility distributions satisfying certain constraints (expressing 
our knowledge about the system), one should pick the one 
with maximum entropy. It has been recognized by Jaynes, 
as well as many other researchers, that this choice gives the 
least biased, the most objective distribution consistent with 
the information one possesses about a system. Maximizing 



entropy under constraints is therefore an important problem, 
and it has been thoroughly studied (see, e.g., Q, |[T0lD . 

It has also been argued ifTTIl . lfT31 that minimum entropy 
distributions can be of as much interest as maximum entropy 
distributions. The MinMax information measure, for example, 
has been introduced in [ 1 1 1 as a measure of the amount of 
information contained in a given set of constraints, and it is 
based both on maximum and minimum entropy distributions. 
More generally, entropy minimization is also very important 
conceptually. Watanabe 1 14 1 has shown that many algorithms 
for clustering and pattern recognition can be characterized as 
suitably defined entropy minimization problems. 

Since entropy is a concavd3 functional, its maximization 
can be solved by standard concave maximization methods. 
On the other hand, concave minimization is in general a 
much harder problem [12|. Indeed, we will show that the 
minimization of joint entropy over convex polytopes is NP- 
hard. In fact, we will show that a much more restrictive 
problem is NP-hard, that of entropy minimization over the 
so-called transportation polytopes, i.e., entropy minimization 
under constraints on the marginal distributions. Restricting the 
problem to transportation polytopes is perhaps the key step 
in our analysis and has several advantages. First, it enables 
one to obtain a very simple proof of NP-hardness by using a 
reduction from the SUBSET SUM problem and some simple 
information theoretic identities and inequalities. Second, it 
will immediately follow from this proof that the problems 
of minimization of conditional entropies and maximization of 
mutual information are also NP-hard. This claim looks difficult 
to prove by some other methods because these functionals are 
neither concave nor convex. 

Maximization of mutual information is certainly an impor- 
tant problem, studied in many different scenarios. A familiar 
example is computing the capacity of the channel which 
amounts to the maximization of this functional over all input 
distributions. This is again a convex maximization problem 
for which efficient algorithms exist 0. In Section [V] we 
will show that the reverse problem - maximizing mutual 
information over conditional distributions, given the input 
distribution - is NP-hard. Another important example is the 
so-called Maximum mutual information (MMI) criterion used 
in the design of classifiers. See, e.g., [ 1 1, [8] for two important 

' To avoid possible confusion, concave means n and convex means U. 



applications of this principle. 

II. Basic definitions 

This section reviews the definitions and basic properties of 
the quantities that will be used later. 

A. Shannon information measures 

Shannon entropy of a random variable X with probability 
distribution P = (pi) is defined as: 



H(X) = H(P) = -^2 Pi log Pi 



(1) 



with the usual convention OlogO = being understood. For 
a pair of random variables (X, Y) with joint distribution S = 
(sjj) and marginal distributions P = (pt) and Q = {qj), the 
following defines their joint entropy: 



H(X, Y) = H XX (S) = -J2 s iJ lo S s > 



h3 > 



(2) 



or to the maximization of mutual information. Therefore, NP- 
hardness of any of these problems will imply that all of them 
are NP-hard. And finally, this will imply that more general 
problems of minimization/maximization of the corresponding 
functionals over arbitrary convex polytopes are NP-hard. 

B. Transportation polytopes 

Let Tn and T^l m denote the sets of one- and two- 
dimensional probability distributions with alphabets of size n 
and n x m, respectively: 

rWHWe*" :*i>o,X> = i} ( 9 ) 

i 

ifL = {(Pi,s) e K" xm : PiJ > , Y,Pi,i = 1} (10) 

Now consider the set of all distributions with marginals P E 
ii 1 ' and Q € denoted C(P, Q): 



conditional entropy: 



C(P,Q) 



Set, 



(2) 



H(X\Y) = H x \ Y (S) = -Y,*iJ*>S—, (3) 



Pi, 53. ■ 



9j 



and mutual information: 

I(X;Y) = I X ;y(S) = Y, s ^ lo & 



again with appropriate conventions. All of these quantities are 
related by simple identities: 



H(X, Y) = H(X) + H(Y) - I(X; Y) 
= H(X) + H(Y\X) 

and obey the following inequalities: 

max {H(X), H(Y)} < H(X, Y) < H(X) + H(Y), 
mm{H(X),H(Y)} > I(X;Y) > 0, 
< H{X\Y) < H(X). 



Equalities on the right-hand sides of ©-([8]) are achieved 
if and only if X and Y are independent. Equalities on the 
left-hand sides of © and ^} are achieved if and only if X 
deterministically depends on Y, or vice versa. Another way to 
put this is that their joint distribution (written as a matrix) has 
at most one nonzero entry in every row, or in every column. 
Equality on the left-hand side of (HJ holds if and only if X 
deterministically depends on Y . We will use these properties 
in our proofs. For their demonstration we point the reader to 
the standard reference 0. 

From identities (0 one can make the following simple, but 
crucial, observation: Over a set of two-dimensional probability 
distributions with fixed marginals (and hence fixed marginal 
entropies H(X) and H(Y)), all the above functionals differ 
up to an additive constant (and a minus sign in the case 
of mutual information). This means in particular that the 
minimization of joint entropy over such domains is equivalent 
to the minimization of either one of the conditional entropies, 



(ID 

(letter C stands for coupling). It is easy to show that sets 

(2.) 

C(P, Q) are convex and closed in r^ m . They are also clearly 

e- ■ (2) 

hi ; (4) disjoint and cover entire T n ^, m , i.e., they form a partition 

of r^ m . Finally, they are parallel affine {n — l)(m — 1)- 
dimensional subspaces of the (n ■ m — l)-dimensional space 

(2) 

Trixm- (We of course have in mind the restriction of the 
corresponding affine spaces in R nxm to R" xm . ) 

The set of distributions with fixed marginals is basically the 
set of nonnegative matrices with prescribed row and column 
sums (only now we require the total sum to be one, but this is 
inessential). Such sets are known in discrete mathematics as 
transportation polytopes [2|. Their name comes from the fact 
that they correspond to the following problem: given n sup- 
plies pi, . . . ,p n and m demands qi , . . . , q m of some "goods" 
(total supply and total demand being equal), describe all ways 
of transporting the goods so that the demands are fulfilled. For 
example, one possible solution to the transportation problem 
with P = (1, 3, 5) and Q = (2, 4, 3) would be 



(5) 



(6) 
(7) 
(8) 




which is obtained by the so-called north-west corner rule. The 
set of all possible solutions constitutes a polytope C(P, Q). 

In the context of the problems studied here, the following 
facts will be useful. A concave continuous function over any 
bounded convex polytope must attain its minimum over one 
of the vertices of the polytope. An interesting fact about 
transportation polytopes is that, if the marginals are integer, 
that is if pi, qj £ N, then all the vertices (observed as matrices) 
have only integer entries. As a consequence, if the marginals 
are rational, that is if Pi,qj E Q, then all the vertices have 
only rational entries. Furthermore, it is not hard to see that 



the description length^] of the vertices is polynomial in the 
description length of the marginals. We shall make use of these 
facts in the proofs of Theorems Q] and [3] 

III. Entropy over transportation polytopes 

As we noted before, we will focus here on sets of probability 
distributions with fixed marginals, i.e., we will consider the 
above mentioned optimization problems over transportation 
polytopes. The problems turn out to be NP-hard even under 
this restriction, and this is perhaps the easiest way to prove 
NP-hardness of their more general versions. 

Let some marginal distributions P and Q be given, and ob- 
serve C(P, Q). From identities (O one sees that over C(P, Q) 
the minimization of H(X, Y) is equivalent to the minimization 
of H{X\Y) and H(Y\X), or to the maximization of I(X; Y), 
so it is enough to consider only the joint entropy for example. 
Joint entropy H(X, Y) is well-known to be concave in the 
joint distribution, and so its minimization belongs to a wide 
class of concave minimization problems which are in general 
intractable £12]. Conditional entropies H(X\Y) and H(Y\X) 
are neither concave nor convex in the joint distribution, but 
over C(P, Q) they are concave because in that case they differ 
from the joint entropy only by an additive constant. By the 
same reasoning, mutual information is convex in the joint 
distribution over C(P,Q). Based on concavity one concludes 
that the optimizing distribution for these problems must be one 
of the vertices of C(P, Q). The trouble with concave functions, 
of course, is that one must visit all of them, or at least a "large" 
portion of them, to decide where the minimum is. 

A. The computational problems 

The most general form of the problem in the context studied 
here would be the following: Given a poly tope (by a system 

(2) 

of inequalities, say) in T n ^ m , find the distribution (matrix) 
S which minimizes the entropy functional Hx,y{S). The 
decision version of this problem is obtained by giving some 
threshold h at the input, and asking whether a given polytope 
contains a distribution S with Hx,y{S) < h. 

Let us now restrict the problem to transportation polytopes. 
Since: 

H(X,Y) > max{H(X),H(Y)}, (12) 

over the transportation polytope C(P, Q) we have: 

H{X, Y) > max {H(P),H{Q)} (13) 

with equality if and only if Y deterministically depends on 
X, or vice versa [3]. In other words, we will have equality 
in ( TT3b iff the joint distribution is such that it has at most 
one nonzero entry in every row, or in every column. We can 
now formulate an even more restrictive problem, with the 
threshold specified in advance: Given a transportation polytope 
C(P, Q) G T^ m , is there a distribution S in this polytope with 
H x ,y(S) < H{P)1 Note that, because of (Qj]), this inequality 
must in fact be an equality. We name this problem Entropy 

2 By the description length of an object, we mean the number of bits 
required to write it down, as usual in the context of algorithmic problems. 



MINIMIZATION even though the name would probably be more 
appropriate for the more general problems mentioned above. 

Problem: Entropy minimization 

Instance: Positive rational numbers p\ , . . . , p n and 

qt,...,q m , with £)" =1 p t = J2"Li H = l - 
Question: Is there a matrix S G C(P,Q) with entropy 

H XtY (S) = H(P)l 

This problem will be shown to be NP-complete. 

We also briefly note here that the corresponding problem of 
the maximization of H(X,Y) (maximization of H(X\Y) or 
minimization of I(X;Y)) over C(P,Q) is trivial because: 

H(X,Y) <H(X) + H(Y) (14) 

with equality if and only if X and Y are independent Q, 
i.e., iff their joint distribution is P x Q = (piqj), and this 
distribution clearly belongs to C(P, Q). 

B. The proof of NP-hardness 

We describe first the Subset sum, a well-known NP- 
complete problem [6|, which will be the basis of the proof 
to follow: 

Problem: Subset sum 

Instance: Positive integers d\, . . . , d n and s. 

Question: Is there a J C {l,...,n} such that 

E jeJ d j = s ? 

Theorem 1: Entropy MINIMIZATION is NP-complete. 
Proof: We shall demonstrate a reduction from the SUB- 
SET SUM problem to the ENTROPY MINIMIZATION problem. 
Let there be given an instance of the SUBSET SUM problem, 
i.e., a set of positive integers s , d\, . . . , d n , n > 2. Let 
D = Y^i=i d{, and let p. L = di/D, q = s/D. The question we 
are trying to answer is whether there is a J C {1, . . . , n} such 
that Yljej dj = s - Observe that this is equivalent to asking 
whether there is a matrix S with row sums P = (pi, ■ ■ ■ ,p n ) 
and column sums Q = (g, 1 — q), which has at most one 
nonzero entry in every row (or, in probabilistic language, 
such that Y deterministically depends on X). We know that 
in this case, and only in this case, the entropy of S would 
be equal to H(P) |3|. So if we create an instance of the 
Entropy minimization problem with P and Q as above, 
the answer to the question whether there exists S G C(P,Q) 
with H x ,y(S) = H(P) will solve the SUBSET SUM problem. 
Therefore, this is the reduction we wanted. It is left to prove 
that Entropy minimization belongs to NP. This is done 
by using the familiar characterization of the class NP via 
certificates |fT3l . We have to show that every YES-instance of 
the problem has a succinct certificate, while no No-instance 
has one, and that the validity of the alleged certificates can 
be verified in polynomial time. The certificate is of course 
the optimizing distribution itself. That it is succinct is easy to 
show (see the comment in the last paragraph of Section IH-Bl i. 
and polynomial time verifiability is even easier, because we 
only have to check that S belongs to C(P, Q) and that it has 
at most one nonzero entry in every row. ■ 



As a straightforward consequence of the above claim, the 
more general problem of finding a minimizing distribution 
is NP-hard. It is an interesting task to determine the precise 
complexity of this problem (in the sense of proving that it is 
complete for some natural complexity class). Note that even 
determining whether it belongs to FNF0 is nontrivial. Whether 
the decision version of this problem, namely, deciding whether 
a given polytope contains a distribution with entropy smaller 
than a given threshold, belongs to NP is also an interesting 
question (which we shall not be able to resolve here). One 
has to be careful when reasoning about "certificates" for these 
problems. Namely, one has to able to check in polynomial time 
that the certificate is indeed valid. In the above proof, we only 
had to check that the given matrix (the alleged certificate) 
belongs to C(P,Q) (i.e., that it has nonnegative entries and 
prescribed row and column sums) and that it has at most 
one nonzero entry in every row, and this is clearly easy to 
do. But in the more general problems mentioned above, one 
is required to compute numbers of the form a log a to check 
whether H(T) < h for example. These numbers are in general 
irrational, and therefore verifying this inequality might not be 
computationally trivial as it might seem. It is interesting to 
mention in this context the so-called SQRT SUM problem: 

Problem: Sqrt sum 

Instance: Positive integers d\, . . . , d n , and k. 
Question: Decide whether Y^i=i V^i < fc ? 

This problem, though "conceptually simple" and bearing cer- 
tain resemblance with checking of certificates in the general 
versions of the entropy minimization problem, is not known 
to be solvable in NP (it is solvable in PSPACE). 

IV. TWO PSEUDOMETRICS WHICH ARE HARD TO COMPUTE 

A variant of the minimization of the above mentioned quan- 
tities produces a distance on the space of discrete probability 
distributions. For a pair of random variables (X, Y) with joint 
distribution S, define J4): 

A(X, Y) = A(S) = H(X, Y) - I(X; Y) 

= H(X\Y)+H(Y\X) (15) 

A'(X,y)EA'(S) = l-M 

The quantity A(X,Y) is sometimes called the variation of 
information. Its normalized variant, A'(X,Y), is basically an 
information theoretic analogue of the Jaccard distance between 
finite sets. Both of these quantities satisfy the properties of 
a pseudometric Q. However, when this statement is made, 
one must assume that the joint distribution of (X, Y) is given 
because joint entropy and mutual information are not defined 
otherwise. This is usually overlooked in the literature. Further- 
more, if these quantities are used as distance measures on the 
space of all random variables, then joint distributions of every 
pair of random variables must be given. For example, one 

3 The class FNP captures the complexity of function problems associated 
with decision problems in NP, see 1131 . 



could first define some random process (X t ) and then take A 
or A' as distances between the random variables X t . In order 
to avoid the dependence on the chosen random process (or 
on some universal joint distribution), and to define a distance 
between individual random variables (more precisely, between 
their distributions) one can make the following definitions: 



A(P,Q)= inf {H x , y (S)-Ix:y(S)}, 
seC(p,Q) 1 ' 



A'(P, Q) = inf 

seC(p,Q) 



1 - 



Ix-y(S) 



(16) 



Hx,y(S), 

This definition mimics the one for the total variation distance: 



d v (X,Y)= inf \P(X^Y)\ 

C(P,Q) 1 ' 



(17) 



where the infimum is taken over all joint distributions of the 
random vector (X, Y) with marginals P and Q. 

Let r« = {(pi) i6N : Pi > 0,EiPi = !}■ We have the 
following. 

Proposition 1: A and A' are pseudometrics on r' 1 ). 
The proof of this proposition is not difficult but we omit it 
here since it is not essential for our current aims. We can now 
prove one more intractability result. 

Theorem 2: Given rational P and Q, determining whether 
A(P, Q) = H(P) - H(Q) is NP-hard. 
Proof: Note that 

A(P,Q)= inf {H X , Y (S)-I X ., Y (S)} 
sec(RQ) 

= 2 inf {H X , Y (S)}-H(P)-H(Q). 

Now the claim follows directly from Theorem [T] ■ 

V. One marginal fixed 

In this section we address similar problems as before, 
only now we fix only one of the marginal distributions, say 
P = (pi, . . . , Pn ). If the cardinality of the alphabet of the 
other random variable Y is not specified, then the problems are 
trivial. Namely, one takes Q = P and for S = diag(P) (two- 
dimensional distribution with masses pi on the diagonal and 
zeros elsewhere) one has H Xt y(S) = I X -y(S) = H{P), and 
hence S is optimal. So assume that the cardinality of the other 
alphabet is bounded to m. Denote the set of all distributions 
with marginal distribution of X fixed to P and the cardinality 
of the alphabet of Y fixed to m, by C(P, m). We have 



C(P,m)= |J C(P,Q). 



(19) 



Minimization of the joint entropy H(X, Y) over such 
polytopes is trivial. The reason is that H(X, Y) > H(P) 
with equality iff Y deterministically depends on X, and so 
the solution is any joint distribution having at most one 
nonzero entry in each row. Since H(X) is fixed, this also 
minimizes the conditional entropy H(Y\X). The other two 
optimization problems considered so far, minimization of 
H(X\Y) and maximization of I(X;Y), are still equivalent 
because I(X;Y) = H(X) - H(X\Y), but they turn out to 



be much harder. Therefore, in the following we shall consider 
only the maximization of I(X; Y). 

When one marginal is fixed, choosing the optimal joint 
distribution amounts to choosing the optimal conditional dis- 
tribution p(y\x). Mutual information I(X; Y) is known to be 
convex in the conditional distribution [3| (and hence, H(X\Y) 
is concave in p(y\x), for fixed p(x)) and so this is again a 
convex maximization problem. This conditional distribution 
can be thought of as a discrete memoryless communication 
channel with n input symbols and to output symbols, and 
hence we name the corresponding computational problem 
Optimal channel. 

Problem: Optimal channel 

Instance: Positive rational numbers p\, . . . , p n with 

Pi = 1, and an integer to. 
Question: Is there a channel C E C(P,m) with mutual 
information Ix-.y{C) > log to ? 

Note that the above inequality must in fact be an equality 
because over C(P, to): 

I(X;Y) < min{iJ(F),logTO} (20) 

which follows from (0 and the fact that H(Y) < log to. 
The above problem is a suitable restriction of a more general 
problem of finding a maximizing distribution, as we did with 
Entropy minimization. 

A. The proof of NF '-hardness 

We describe next the well-known PARTITION (or NUMBER 
PARTITIONING) problem J6). 

Problem: Partition 
Instance: Positive integers di, . . . , d n . 
Question: Is there a partition of {di, . . . , d n } into two 
subsets with equal sums? 

This is clearly a special case of the SUBSET SUM problem. 
It can be solved in pseudo-polynomial time by dynamic 
programming methods [6|. But the following closely related 
problem is much harder. 

Problem: 3 -Partition 

Instance: Nonnegative integers d\ , . . . , d^m and k with 
fc/4 < dj < k 1 2 and £V dj = m k- 

Question: Is there a partition of {l,...,3m} into 
to subsets Ji,...,J TO (disjoint and cover- 
ing {1, . . . , 3to}) such that X^e J are all 
equal? (The sums are necessarily k and every 
J, has 3 elements.) 

This problem is NP-complete in the strong sense [6|, i.e., no 
pseudo-polynomial time algorithm for it exists unless P=NP. 

The following theorem will establish that, given an infor- 
mation source, determining the best channel (in the sense of 
having the largest mutual information) is NP-hard. 

Theorem 3: Optimal CHANNEL is NP-complete. 
Proof: We prove the claim by reducing 3-PARTITION to 
Optimal channel. Let there be given an instance of the 



3-PARTITION problem as described above, and let pi = di/D 
where D — ^dj. Deciding whether there exists a partition 
with described properties is clearly equivalent to deciding 
whether there is a matrix C £ C(P, to) with the other marginal 
Q being uniform and C having at most one nonzero entry in 
every row (i.e., Y deterministically depending on X). This on 
the other hand happens if and only if there is a matrix C € 
C(P, to) with mutual information equal to H(Q) = log to. 
Therefore, solving the OPTIMAL CHANNEL problem with 
instance (pi) as above will solve the 3-PARTITION problem. 
This shows the NP-hardness of OPTIMAL CHANNEL. It is left 
to prove that it belongs to NP. The reasoning here is completely 
analogous to the one in the proof of Theorem Q] namely, the 
certificate is the optimal distribution/matrix itself. ■ 
The problem remains NP-complete even over C(P, 2), i.e., 
when the cardinality of the channel output is fixed in advance 
to 2. In that case the problem is equivalent to the PARTITION 
problem. 

It is easy to see that the transformation in the proof of 
Theorem [3] is in fact pseudo-polynomial [6| which implies 
that Optimal channel is strongly NP-complete and, unless 
P=NP, has no pseudo-polynomial time algorithm. 
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